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REMARKS 

Claims 1-8 were rejected under 35 U.S.C. 101 as being directed to non-statutory subject 
matter. Responsive to the Examiner's suggestion for claim amendments to address this issue, 
Applicants have amended claim 1 to recite "outputting the identified group of genes having 
characteristic values greater that the threshold in a selected data format." Withdrawal of the 
Section 101 rejection of claim 1 is requested. 

No prior art rejection has been asserted against dependent claim 2. Claim 1 has further 
been amended to include the limitations of objected to dependent claim 2. Claim 1 is 
accordingly in condition for favorable action and allowance. All claims depending from claim 1 
are also in condition for allowance. 

With respect to withdrawn claims 9-12, Applicants request rejoinder as these claims 
depend from claim 5 which is now in condition for allowance (and since claim 5 depends from 
claim 1). 

Claims 13 and 14 were rejected under 35 U.S.C. 103(a) as being unpatentable over 
Shamir in view of Dougherty and Tolley. 

Claim 13 has been amended to recite "the processing sub-system considering all possible 
pairs of generated sub-tables and generating signals, for each pair of sub-tables, representing 
characteristic parameters of data associated to genes of that pair of sub-tables." The claimed 
operation considers each pair of sub-tables (i.e., a cluster pair) for purposes of determining 
whether the genes in that cluster pair are "members of a network of genes likely to be involved in 
a particular cellular process." This is accomplished by having the processing sub-system 
consider "correlation among and between the included genes" with the cluster pair. The 
"intelligent sub-system" receives this correlation information in the form of the "signals 
representative of characteristic parameters" and generates "for each pair of sub-tables a 
characteristic value determined as a function of the characteristic parameters." If that 
"characteristic value is greater than a certain pre-established threshold," then the genes within the 
associated cluster pair are considered to be "members of a network of genes likely to be involved 
in a particular cellular process." 
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There is no teaching or suggestion in the cited prior art for "considering all possible pairs 
of generated sub-tables" in the manner claimed. Still further, there is no teaching or suggestion 
in the cited prior art for generating, in connection with each pair of sub-tables (a cluster pair), 
"signals representative of characteristic parameters" which reflect "correlation among and 
between the included genes" with the cluster pair. Still further, there is no teaching or suggestion 
in the cited prior art for generating "for each pair of sub-tables a characteristic value determined 
as a function of the characteristic parameters" for that cluster pair. Still further, there is no 
teaching or suggestion in the cited prior art for comparing the characteristic value to a certain 
pre-established threshold and identifying the genes within the cluster pair as "members of a 
network of genes likely to be involved in a particular cellular process" if the threshold is 
exceeded. 

The Examiner cites to Shamir, and more particularly Hartuv which is cited within 
Shamir, in support of rejecting claim 13. Applicants respectfully submit that the Examiner does 
not fully understand how the Hartuv process operates. With such an understanding, Applicants 
submit that the Examiner will find the claimed operation to be distinct from Shamir/Hartuv. 

To assist the Examiner in understanding Hartuv, Applicants attach hereto a copy of the 
Hartuv article cited by Shamir. The Hartuv algorithm is understood by Applicants to work as 
follows: 

In the Hartuv method (page 251 left col. lines 31-32 from top), possible connections 
between clones are graphically represented with a graph GO in which each vertex corresponds to 
a clone and the connections between two vertices i and j is represented by (see, page 251 left col. 
lines 24-25) the intensity level of the hybridization of clone i with probe j. According to Hartuv 
(see, page 251 left col. lines 40-44), "a group of clones originating from the same gene should 
form a subgraph with a high connectivity value. In contrast, subgraphs formed by clones from 
different genes should have lower connectivity." Highly connected subgraphs (see, page 251 left 
col. fifth last line) are identified as clusters and are isolated from the whole graph (see, page 251 
left col. last line to right col. second line) by cutting a minimum number of edges. Subgraphs are 
recognized as being highly connected (see, page 251 left col. eighth to sixth last lines; and also 
US 2003/0224344 par. 15) if their connectivity value is larger than a threshold. 
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Hartuv recognizes that the basic embodiment of his algorithm (see, page 251 right col. 
lines 9-11) may leave certain vertices (clones) as unclustered singletons or (see, page 251 right 
col. line 30) may split (in two or more parts presumably) a group of genes that should have been 
identified as belonging to a same cluster. In practice, Hartuv indicates (page 251 right col. par. 
"Iterated HCS") that the step of removing edges from the original graph may lead to the 
generation of artificially separate subgraphs, and thus to an erroneous recognition of a same 
group of clones into two (or more) different groups. For this reason, a further refinement step 
("cluster merging") is carried out. 

Denomination of this refinement step as "cluster merging," however, appears to be 
improper because the aim of this step is not to combine two truly distinct clusters, that is two 
subgraphs that in the original graph were not connected, but rather to correct (or rather re- 
connect) possible artificial separations of a group of clones that was to be recognized as 
constituting a same cluster. The output of the refined embodiment of the Hartuv method is a list 
of clusters of clones (see, US2003/0224344, par. 15 right col. first two lines), each cluster 
corresponding in the original graph to a respective highly connected subgraph. 

Thus, Hartuv teaches performing a merging operation with respect to clusters in order to 
identify and combine those clusters which were improperly separated and are not distinct from 
each other. In fact, these clusters are connected on the original graph and belong to the same 
cluster (however, which were artificially separated as described above). In essence, the merging 
operation of Hartuv is designed to correct for an earlier introduced error in the process. 

The cluster pairing performed in the claimed invention, however, is distinct from the 
Hartuv process. In the claimed invention, all possible pairs of clusters are considered. Hartuv 
does not teach this operation. Still further, Applicants claim that the cluster pairs are processed 
in order to determine whether the genes in the cluster pair can be identified as being part of a 
gene network ("members of a network of genes likely to be involved in a particular cellular 
process"). There is no teaching or suggestion in Hartuv making such an analysis or coming to 
such a conclusion. 

Applicants further submit that Shamir/Hartuv fails to teach or suggest pairing clusters for 
the claimed purpose of extracting characteristic parameters which express correlation among and 
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between the included genes. Rather, Shamir/Hurtuv form cluster pairs in order to refine the 
previously performed clustering operation (in other words, in order to make or form more 
accurate clusters). The point of the Hartuv process is to merge clusters together to become a 
single cluster. Applicants do not perform such a merger. Rather, clusters are paired together and 
then the pair is processed based on characteristic parameters (regarding correlation between the 
genes in the cluster pair) and a characteristic value (determined from the parameters of the 
cluster pair) in order to determine whether the genes in the cluster pair are members of a network 
of genes likely to be involved in a particular cellular process. 

The Shamir/Hartuv teaching simply teaches a clustering process. This is analogous to 
claim 13 and the recitation for "pre-processing sub-system input with data of a table relative to 
gene expressions variable with time and/or different environmental conditions, the pre- 
processing sub-system generating sub-tables of data in groups of genes that satisfy a pre- 
established clustering criterion." The claimed invention goes well beyond simple clustering. 
This is emphasized by Applicants at paragraph 1 1 of the specification which recites that the 
invention determines complex relationships among genes "that go beyond the simple clustering 
operations of the known methods." An example of such complex relationships is given in 
paragraph 12 of the specification. The Hartuv process may, after clustering is completed, have 
identified genes in different clusters as being in separate unrelated groups. The present 
invention, by then further considering pairs of clusters and the genes included therein, can find a 
gene network in the genes of the cluster pair for which Hartuv would have not found any 
relationship. It is in this way that the claimed invention differs from Shamir/Hartuv. 

New claim 15 has been added. This claim is believed to be patentable over the cited art 
for at least the reasons recited above. 

New claim 16 has been added. This claim is believed to be patentable over the cited art 
for at least the reasons recited above. Claim 16 further includes limitations relating to filter data 
and the establishment of pair combinations of cluster pair, filter data pair and cluster/filter pair. 
The pair combinations are processed to determine a characteristic value based on correlation 
parameters, with the characteristic value being compared to a threshold to determine whether the 
genes in the pair combination are members of a network of genes likely to be involved in a 
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particular cellular process. There is no teaching or suggestion in the cited prior art for the cluster 
pair, filter data pair and cluster/filter pair processing as claimed. 

New claim 17 has been added. This claim is believed to be patentable over the cited prior 
art for at least the reasons recited above with respect to claim 13. 

In view of the foregoing, Applicants respectfully submit that the application is in 
condition for favorable action and allowance. 

Dated: June 21, 2007 RespectfuJ%/siibMtted,y 
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Clustering large data sets is a central challenge in 
gene expression analysis. The hybridization of syn- 
thetic oligonucleotides to arrayed cDNAs yields a fin- 
gerprint for each cDNA clone. Cluster analysis of these 
fingerprints can identify clones corresponding to the 
same gene. We have developed a novel algorithm for 
cluster analysis that is based on graph theoretic tech- 
niques. Unlike other methods, it does not assume that 
the clusters are hierarchically structured and does not 
require prior knowledge on the number of clusters. In 
tests with simulated libraries the algorithm outper- 
formed the Greedy method and demonstrated high 
speed and robustness to high error rate. Good solution 
quality was also obtained in a blind test on real cDNA 

fingerprints. © 2000 Academic Press 



INTRODUCTION 

Information on the expression levels, of genes under 
various conditions is key to elucidating their function. 
One way to measure gene expression levels is by sam- 
pling cDNAs from the tissue and measuring the 
amount of cDNA of each gene in the sample. If the 
cDNAs are picked at random, the abundance of cDNAs 
extracted indicates the relative expression levels of 
their genes. 

Out of about 100,000 different human genes, the 
number of genes active in a human cell at any time is 
over 10,000 (Kinzler et al, 1997). The relative abun- 
dance of cDNAs of different genes may vary by a factor 
of 10,000. This clearly indicates that the size of the 
sample of cDNAs that must be extracted from a cell to 
obtain adequate representation of low-abundance 
genes must be on the order of 100,000 or more. 

Sequencing some 100,000 cDNAs per sample is slow 
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and prohibitively expensive. It is also very inefficient, 
as the high-abundance transcripts are resequenced 
again and again. Normalized libraries do not fully 
overcome this redundancy (Bonaldo et al„ 1996). Oligo- 
fingerprinting was developed as an alternative ap- 
proach (Lennon and Lehrach, 1991; Crkvenjakov et al, 
1991; Vicentic et al, 1992; Drmanac and Drmanac, 
1994; Drmanac et al, 1996; Meier-Ewert et al, 1994; 
Milosavljevic et al. 1995). It is based on spotting 
poly (dT) -primed cDNAs on high-density filters. When a 
short synthetic oligonucleotide probe hybridizes with 
the filter under stringent conditions, one obtains a 
positive hybridization signal with all cDNA clones that 
contain a DNA sequence complementary to that of the 
probe. By repeating the experiment with different 
probes, one obtains for each clone a fingerprint vector, 
indicating its hybridization level with each probe. 
cDNAs originating from the same gene have similar 
oligomer contents and thus should have similar finger- 
prints. Based on the fingerprints, one can devise com- 
puter algorithms to identify and cluster cDNAs origi : 
nating from the same gene. As a result, ideally, only 
one cDNA will have to be sequenced from each cluster, 
and the cluster size will tell the abundance of its gene. 
Good algorithms must overcome practical difficulties, 
which include error-prone fingerprints and substantial 
variability in cDNA length (Meier-Ewert et al, 1998). 

Alternative technologies such as DNA microchips 
(Fodor et al, 1993; Schena et al, 1996) have the ad- 
vantage of being able to determine the expression lev- 
els of thousands of genes in parallel through a single 
hybridization. However, they are not applicable in all 
cases, as their application requires that the gene/ORF 
ensemble be known exactly in advance, and much 
larger amounts of tissue RNA are required for a single 
analysis. While the sensitivity is increasingly im- 
proved, the oligo-fingerprinting approach is currently 
one of the most effective strategies for the analysis of 
novel genes or organisms with few identified genes. 
Large sequencing projects have successfully identified 
the majority of human genes and will undoubtedly do 
the same for a limited number of model organisms. 
However, it is highly unlikely that the same resources 
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will be made available in the near future for sequenc- 
ing other important model genomes such as Am- 
phioxus, sea urchin, and a number of plants and fungi. 
For such species, oligofingerprinting is an effective 
means to large-scale gene discovery. 

We present here a new algorithm for clustering, 
called HCS (abbreviation of highly connected sub- 
graphs). The algorithm has been tested intensively on 
simulated data and was shown to give good results 
even in the presence of relatively high levels of noise. It 
was also shown to outperform a central algorithm that 
has been used for expression analysis (Milosavljevic et 
ah, 1995). In a blind test on a real data set consisting of 
2329 cDNAs, the algorithm achieved very good results. 

Several graph theoretic approaches to cluster anal- 
ysis have been suggested (see, e.g., Mirkin, 1996; 
Matula, 1972; Hansen and Jaumard, 1997). Those in- 
clude finding connected components, strongly con- 
nected components in directed graphs, cliques, and 
maximal cliques. For a critique of these approaches, 
see Matula (1970). Our approach is different and is 
based on repeated minimum-cut computations. Two 
other approaches that are more similar to ours were 
proposed by Matula (1969, 1970, 1972, 1977) and by 
Wu and Leahy (1993). Both of these approaches lack 
our important stopping criterion, with the ensuing 
provable results on the clustering quality. In particu- 
lar, these algorithms do not guarantee a key property 
of HCS solutions: clusters have diameter 2; i.e., two 
elements in the same cluster have a high degree of 
similarity to each other or to a common third member 
of that cluster. Wu and Leahy's algorithm requires that 
the number of clusters be known in advance, and all 
the cuts are computed with respect to the same original 
graph. For other limitations of these methods, see Har- 
tuv (1998). 

For the specific problem of clustering cDNA finger- 
prints, several approaches were suggested previously: 
Drmanac et al. (1996) build clusters around connected 
components in the similarity graph. In that graph, 
vertices correspond to cDNAs and edges correspond to 
pairs whose similarity is above a threshold 0 (see Ma- 
terials and Methods for definitions). Even with a low 
false-positive rate in the data, such an algorithm would 
incorrectly connect true clusters. The way to avoid this 
is by increasing sharply the threshold for similarity, 
which causes the splitting of many clusters. Meier- 
Ewert et ah (1998) and Poustka et al (1999) build 
clusters by computing all maximal cliques and merging 
maximal cliques with sufficiently large overlap into a 
single cluster. Computing all maximal cliques is com- 
putationally intractable (Garey and Johnson, 1979). 
Moreover, a high false-negative rate may break large 
clusters into many maximal cliques with a complicated 
and hard-to-detect overlap structure. Milosavljevic et 
al (1995) build clusters using the Greedy algorithm. In 
each step a new seed clone is chosen, and all clones that 
are sufficiently similar to the seed are added to its 
cluster and removed from the data set. To merge 



falsely separated clusters, the algorithm is run twice: 
Before the second phase, an average fingerprint is com- 
puted for each cluster, and the fingerprints of all the 
clones in that cluster are replaced by it. Like most 
Greedy approaches, this algorithm cannot handle well 
high noise levels, and the quality of its results is very 
sensitive to the starting point. Meier-Ewert et al. 
(1998) use an algorithm that combines ideas from the 
two previous methods. Our approach is completely dif- 
ferent from all previous cDNA clustering methods. We 
shall show real data and simulation results that dem- 
onstrate the superiority of our algorithm over the 
Greedy algorithm. 

MATERIALS AND METHODS 

Library and Fingerprint Construction 

The 2329 cDNAs originating from 18 genes used in testing the 
clustering algorithm were part of a library of some 100,000 cDNAs 
prepared from purified peripheral blood monocytes. These were iso- 
lated, and the cDNA was synthesized by Will Phares (Novartis 
Forschungsinstitut, Vienna, Austria). cDNAs were synthesized by 
oligo(dT) priming, cloned into the plasmid vector pSPORTl (Life 
Technologies) and transformed into DH10B Escherichia coli cells 
(Life Technologies). Individual colonies were plated out on agar 
plates and picked into microtiter plates using a Q-bot robot (Genetix, 
Dorset, UK). A total of 100,000 primary cDNA clones were picked 
and then arrayed at high density onto nylon membranes using the 
same Q-bot device. 

High-density arrays were then hybridized with whole cDNA clones 
(also oligo(dT) primed, average length —1000 nt) whose identity had 
previously been determined by sequencing (see Table 1). Probes were 
labeled by random hexamer priming using [a-^PJdATP. Hybridiza- 
tions were carried out in 0.5 M sodium phosphate, pH 7.2, 7% SDS, 
at 65 °C for at least 3 h and then washed twice at high stringency in 
40 mM sodium phosphate, pH 7.2, 0.1% SDS, at 65°C, for 30 min. 
Positive hybridization signals were scored on three intensity levels, 
and only those with the highest scores were considered for the 
purpose of this analysis. By this approach we were able to identify 
2329 clones representing the 18 selected gene sequences in the entire 
library of 100,000 cDNAs. Hybridized membranes were exposed to 
phosphor storage screens and subsequently scanned using a Phos- 
porimager (Molecular Dynamics, Sunnyvale, CA). Resulting hybrid- 
ization image files were then analyzed, and positive clones were 
identified using custom-written software (VisualGrid, available as 
free download from http://www.gpc-ag.com) . 

Oligonucleotide fingerprints were generated as described in Meier- 
Ewert et al. (1998), by successive hybridization of 139 decamer 
oligonucleotides. Each decamer oligonucleotide probe was in fact a 
pool of 16 decamers that share a common 8 nt core sequence (i.e. 
NxxxxxxxxN, where N is any nucleotide, and xxxxxxxx is the specific 
core sequence). The fingerprints of all the cDNAs that were positive 
in the back hybridizations with the 18 gene-specific probes were then 
input into the clustering algorithm. 

Hybridization Data Preprocessing 

Hybridization intensities were renormalized as described in 
Meier-Ewert et al. (1998). From the fingerprints, a real-valued ma- 
trix Swas formed, containing similarity values between all pairs of 
cDNAs, with values ranging from 3.42 to 139. The graph G 0 used by 
the algorithm had a vertex for each clone and an edge connecting two 
clones if and only if their similarity exceeded 6 - 110 (see details 
below). 

Simulation Set-Up 

The simulation process receives as input the following parameters: 
The number of genes in the experiment is N^. Gene / has C, copies. 
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so Ct is the true size of the cluster of gene /. Hence, the total number 
of clones In the simulation is n = Xj^Cy. L a and L b are the mini- 
mum and maximum possible lengths, respectively, of a cDNA. Clone 
lengths are generated according to a normal distribution with mean 
/i, = (L a + L b )/2 and standard deviation <r = (L b + LJ/6. The number 
of probes is p. Probes are assumed to occur along a gene with Poisson 
distribution with rate A. This assumption was originally suggested in 
Michiels era/. (1987) and was adopted by other researchers (Alizadeh 
et al, 1995; Piatt and Dix, 1997; Mayraz and Shamir, 1999). The 
probability that an oligonucleotide occurrence did not register (false- 
negative hybridization probability) is a. False-positive hybridiza- 
tions are assumed to have Poisson distribution with rate /3. All probe 
occurrences and error events are assumed to be independent. The 
result of the simulation is an n X p hybridization matrix H % in which 
H y = 1 if clone i hybridized with probe j and H u - - 1 otherwise. 

We note that the probabilistic assumptions are used only to gen- 
erate the data for the simulations and are not used by the HCS 
algorithm. The only assumption that the algorithm makes is that 
fingerprints of clones from the same cluster tend to have higher 
similarity values. 

The HCS Clustering Algorithm 

The input to the clustering process is the n X p hybridization 
matrix H, in which rows correspond to the cDNA clones in the 
library, and columns correspond to the probes. H tJ is the intensity 
level of the hybridization of clone / with probe j. The ith row, H h is 
the fingerprint of the ith clone. Since cDNAs that originate from the 
same : gene have similar fingerprints, a good clustering algorithm 
should form a partition in which each cluster contains cDNAs orig- 
inating from the same gene. The matrix His used to form the n X n 
similarity matrix 5, where S y = 2 k H a • H Jk . For a real value 0, the 
similarity graph G 0 is a graph with vertices corresponding to the 
clones and edges connecting clones whose pairwise similarity is at 
least 0 (Manila, 1977; Mirkin, 1996). 

We provide some standard graph theoretic terminology needed to 
describe the algorithm: A cut in a graph is a set of edges whose 
removal disconnects the graph. The connectivity of a graph G, de- 
noted k{Q, is the minimum size of a cut in G. A minimum cut is a cut 
with minimum number of edges. Several algorithms are available for 
efficient computation of a minimum cut (Ahuja et al., 1993). Our 
algorithm is based on the following key observations: A group of 
clones originating from the same gene should form a subgraph with 
a relatively high connectivity value. In contrast, the subgraph 
formed by clones from different genes should have lower connectiv- 

The clustering algorithm works on the similarity graph. Had the 
similarity graph represented the cluster structure perfectly, each 
cluster would have been a clique, as all members of a cluster are 
highly similar, and no two clusters would be connected by an edge, as 
elements from distinct clusters are supposed to be dissimilar. In 
reality, searching for cliques in the graph would fail in two ways: 
First, finding maximum cliques is computationally intractable 
(Garey and Johnson, 1979). Second, and more important, real hy- 
bridization matrices contain many errors. Errors in the hybridiza- 
tion data generate inexact fingerprints, leading in turn to errors in 
the similarity graph: missing edges between vertices in the same 
cluster and false (extra) edges between vertices in different clusters. 
Our algorithm was designed to withstand high error rate and work 
in low-degree polynomial time. 

Below we give a high-level description of the algorithm. The de- 
tailed exposition and proofs of the algorithm's mathematical proper- 
ties are given elsewhere (Hartuv and Shamir, submitted for publi- 
cation). A key definition for our approach is the following: A graph G 
with n > 1 vertices is called highly connected if k{Q > nl2. A highly 
connected subgraph is an induced subgraph that is highly connected. 
Our algorithm identifies highly connected subgraphs as clusters. The 
basic HCS algorithm is recursive: On an input graph G, it deter- 
mines whether that graph is highly connected by computing a min- 
imum cut in G If G is highly connected, then its vertices form a 
cluster and the algorithm halts. Otherwise, the edges of a minimum 



cut are removed, forming two connected components, and the algo- 
rithm continues recursively on each component. 

The running time of the basic HCS algorithm is bounded by 27V x 
f[n,m), where f{n,m) is the time complexity of computing a minimum 
cut in a graph with n vertices and m edges, and Af is the number of 
clusters. Typically N<£ n. The best deterministic time bound known 
for f[n t m) is 0(nm) (Matula, 1987; Nagamochi and Ibaraki, 1992). 

Improvements 

Singleton adoption. The basic HCS algorithm may leave certain 
vertices as unclustered singletons. Subsequently, each singleton is 
checked whether it fits into one of the clusters. For each singleton x, 
we compute the number of neighbors it has in each cluster and in the 
singleton set S. If the maximum number of neighbors is sufficiently 
large, and is obtained by a cluster (and not by 5), then *is added to 
that cluster. The process is repeated up to J times (/ = 50 was used 
in practice) to accommodate the changes in clusters as a result of 
previous adoptions. 

The low-degree heuristic. When the input graph contains low- 
degree vertices, initial minimum cut iterations will separate them 
one by one from the rest of the graph. Removing low-degree vertices 
from G e eliminates such noninformative iterations and significantly 
reduces the running time. To utilize this idea, the algorithm receives 
as input a degree sequence, i.e., a decreasing sequence of integers 

d lt d 2 d t , and performs t major iterations. In major iteration /, we 

remove all vertices with degrees below d ( and then apply the HCS 
algorithm, followed by singleton adoption. All clustered vertices are 
set aside, and the next major iteration is applied to the remaining 
graph, using the smaller value d^. 

Cluster merging. To overcome the possibility of cluster splitting, 
we applied a final cluster-merging step. This step uses the raw 
fingerprints and was implemented as described in Milosavljevic etal. 
(1995), to facilitate a comparison of the two algorithms. Specifically, 
an average fingerprint is computed for each cluster, and clusters that 
have highly similar fingerprints are merged. 

Iterated HCS. When there are several cuts attaining the mini- 
mum value in the current subgraph, the minimum cut algorithm 
chooses one arbitrarily. This may cause some splitting of small 
clusters into singletons. The HCS algorithm can then be reapplied on 
the subgraph induced by unclustered elements. Iterating this proce- 
dure several times overcomes the problem. 

Implementation. The simulation algorithm was written in MAT- 
LAB. HCS was written in C+ + within the LEDA 3.4.1 environment 
(Mehlhorn and Naher, 1995). The minimum-cut algorithm imple- 
mented in LEDA has an <D(nm + n 2 log n) time complexity (Stoer and 
Wagner, 1997). Average elapsed time on a 194 MHZ SGI challenge L 
machine with 32 kB instruction cache and 1024 kB main memory 
was about 43 min for the 2329 clones data set (see Results). Clus- 
tering of another 7800 elements in simulation with slightly lower 
noise levels required only 6 min. 

RESULTS 

Clustering Real cDNA Data 

We tested the algorithm in a blind test on real cDNA 
data. The input contained hybridization fingerprints of 
2329 cDNAs with 139 oligonucleotide probes. The 
clones originated from 18 different genes. The high- 
fidelity clustering, obtained by hybridization with long, 
unique sequences, is given in Table 1. Note the high 
variability in abundance of genes, ranging from over 
700 cDNAs to a single cDNA per gene. The HCS algo- 
rithm found 16 clusters and left 206 entities as single- 
tons. The results of the algorithm are shown in Fig. 1 
and Table 2. In 13 of the 16 clusters, over 92% of the 
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TABLE 1 



True Clusters and Gene Identities 
in the Real Data Set 
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Note. Clusters are numbered by increasing size. For unidentified 
genes, only their cDNA number is given. 



entities belong to the same gene (Table 2). Therefore 
we call those clusters almost pure. Note the high level 
of noise (Fig. IB). 

To quantify the quality of a solution in comparison 
with the true clustering, we used the following mea- 
sure: Represent a clustering solution of n elements by 
an n X n symmetric matrix C, where C y = 1 if / and j 
belong to the same cluster and C tJ = 0 otherwise. Given 
such matrix representations of the true clustering T 
and any clustering C of the same data set, the 
Minkowski measure for the quality of C (Sokal, 1977; 
Jardine and Sibson, 1971) is the normalized distance 
bet ween the two matrices, f\\T - CH)/||7]|, where ||7|| 
=V2/2 7 77j. Since the matrices are binary, this is sim- 
ply the number of pairs on which the two solutions 
disagree (i.e., they are clustered together in one solu- 
tion but not in the other), normalized according to the 
true solution. A perfect clustering would thus obtain 
the score zero. 

Table 3 summarizes the quality of clustering results 
of the algorithm in comparison with the true solution 
and with two other algorithms. The Minkowski score of 
the solution formed by the HCS algorithm is 0.71, 
compared to 0.77 of the Greedy algorithm. (To give 
maximum power to the Greedy algorithm, we tried 
several combinations of threshold values for the two 
phases of the algorithm and chose the combination that 
minimizes the Minkowski score. For the HCS, we chose 
a fixed threshold value in a blind manner and opti- 
mized the threshold only in the cluster-merging phase; 
see Materials and Methods.) The application of cluster 
merging was essential to obtaining a superior score. 
Although the gap in the Minkowski score is not large, 
the solutions differ dramatically on other parameters: 



Greedy generated an excessive number of clusters and 
left more than twice as many singletons. Note that 
both algorithms do not assume knowledge of the true 
number of the clusters. 

As another measure to the solution quality, Table 3 
also gives the number of false-positive and false-nega- 
tive errors that correspond to each of the three solu- 
tions, as reflected in G e . The true (reference) solution 
had 70% missing edges ("false-negative" errors with 
respect to the graph G e ) and 0.7% extra edges ("false 
positives" in G g ). In comparison, HCS had 62% missing 
edges and 0.77% extra edges. Greedy had the lowest 
missing edges rate but obtained a substantially higher 
extra edges rate. 

As a side remark, we suspect that cluster Ti 3 , which 
was the most fragmented in (40S Ribosomal protein 
S3) our solution (Table 2), is in fact incorrect. Its inho- 
mogeneity can also be seen in the similarity matrix 
(Fig. IB). (See Discussion for further comments.) 

Simulation Results 

We tested the algorithm on simulated data gener- 
ated with varying noise levels. The detailed description 
of the simulation setup is given under Materials and 
Methods. Figure 2 summarizes the results of system- 
atic experiments with the HCS algorithm on simulated 
data. Figure 2A shows the performance of the algo- 
rithm for various sizes of problems. The results are 
consistently good, and they improve as the problem 
size increases, as the effect of the smaller clusters 
diminishes. Figure 2 also gives results of the Greedy 
algorithm of Milosavljevic et al. (1995) on the same 
problems. Since only binary fingerprints were gener- 
ated in the simulation, the cluster-merging phase was 
not used by both algorithms (see Discussion). The HCS 
algorithm is consistently superior, and the difference 
in quality is up to an order of magnitude. Moreover, the 
solutions of the Greedy algorithm deteriorate as the 
problem size increases, while those of the HCS algo- 
rithm improve. The variance in the quality of the 
Greedy solutions is substantially larger, due to the 
extreme dependence of that algorithm on the starting 
point. 

The effective range of the number of probes (Fig. 2B) 
is 100-300. Increasing the expected false-positive hy- 
bridization rate up to 25% has a negligible effect on the 
quality of the results! In contrast, the false-negative 
hybridization rate can be increased to 40-50% with 
little effect. Beyond these values, quality decreases 
rapidly as the error parameters increase. These results 
are quite encouraging, as the error rate in real large- 
scale hybridization experiments is quite high, but it 
falls within the range giving high-quality clustering 
results according to our experiments. 

DISCUSSION 

We have designed a novel algorithm for clustering 
cDNA fingerprints and have tested it on real and sim- 
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FIG. 1. Clustering results on real cDNA data. (A) The blnarized similarity matrix S. A black point at position (ij) indicates that S u > 
110. (B) Reordering of A according to the true clustering. cDNAs from the same true cluster appear consecutively, and black lines delineate 
borders between different clusters. (C) Reordering of A according to the clustering produced by the HCS algorithm. cDNAs from the same 
computed cluster appear consecutively, with no borderlines. Clusters are presented in the order of detection. (D) Reordering of A according 
to the solution produced by the Greedy algorithm. 



ulated expression data with very good results. Experi- 
ence with more real data examples would allow even 
better tuning of the algorithm. The test reported here 
was the first run of the algorithm on real data for 
which the solution was known. 

Although the solution generated by the algorithm is 
not perfect, we argue that it is quite good and can serve 
as a useful tool in gene discovery and expression anal- 
ysis. By sequencing a small sample from each cluster, 
one can identify those clusters that are relatively pure 
and concentrate on sequencing the nonpure and low- 
abundance clusters only. Another strategy for using 



the clustering results is to compute the average finger- 
print of each cluster and compare it to known gene 
sequences from databases. This approach was success- 
fully demonstrated by Meier-Ewert et aL (1998) and 
Poustka et al (1999). Using that strategy, most known 
genes can be detected without any sequencing. This 
strategy can also detect impure clusters. 

Our real data test used as the "true" reference a 
clustering obtained by hybridization with long (—1000 
nt) probes. Though not always completely accurate, 
that clustering has high fidelity. The stringent hybrid- 
ization conditions permit the detection of highly com- 
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TABLE 2 

Comparison of the Clusters Formed by the HCS Algorithm with the True Clusters 
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Note. T x T, 8 : the true clusters. C lt . . . , C l6 : clusters found by the HCS algorithm. 5: singleton set. Position (ij) is the number of clones 

that belong to C, and 7}. Boldface numbers indicate 92% or more (95% or more except for C 5 ) of their row totals, indicating pure or almost 
pure clusters. 



plementary sequences only, as false-positive matches 
require significant homology over stretches of more 
than 300 nt (Sambrook era/., 1989). Certainly, a per- 
fectly true solution would be to sequence fully all 
cDNAs used in this analysis, but for quality assess- 
ment this reference is quite adequate. 

The basic HCS algorithm repeatedly splits the set of 
clones using minimum cuts until a highly connected 
subgraph is formed. Such an algorithm has some prov- 
able properties that are desirable for clustering (Har- 
tuv and Shamir, submitted for publication). In partic- 
ular, it guarantees that every cluster has diameter 2, 
namely, each two elements in the same cluster are 
highly similar (i.e., with similarity level above 0) or are 
both highly similar to a common third member of the 
cluster. In contrast, the union of any two subgraphs 
split by the algorithm is unlikely to manifest such 
cohesive properties. 

The low-degree heuristic is intended to speed up the 



basic algorithm. Our experience shows that a judicious 
choice of the degree sequence has a dramatic effect on 
the running time with only a minor effect on the clus- 
tering quality. For example, on the real data set dis- 
cussed above, different degree sequences reduce the 
running time by a factor of 40 and yet lead to extremely 
similar results (Hartuv, 1998). Selecting the appropri- 
ate degree sequence requires some experience or ex- 
perimentation with the problem data. One useful aid 
may be the knowledge of the correct clustering on some 
small subset of the data. Such knowledge is often avail- 
able for fingerprint data, since some known genes are 
used as internal controls to monitor the quality of the 
hybridization process (Meier-Ewert et al t 1998). 

A key parameter that influences the clustering qual- 
ity is the threshold 0. Like the degree sequence, a good 
value of 0 can be determined using a control subset. 
Our simulations show that the range of 0 values that 
give near-optimal clustering results is quite wide (re- 



TABLE 3 



Quality of Solutions Given by Three Algorithms on the 2329 Clones Data Set 





Total 


Missing 


Missing 
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Extra 
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Solution type 


edges 
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clusters 


Minkowski 
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True 


400,278 


281,895 


70 


15,909 


0.69 


18 






HCS without merge 


220,434 


114,810 


52 


28,668 


1.15 


17 


0.83 


206 


HCS 


302,544 


186,711 


62 


18.459 


0.77 


16 


0.71 


206 


Greedy 


182,435 


99,523 


55 


51.380 


2.03 


66 


0.77 


478 



Note. Total edges: number of intracluster pairs of clones in the solution. Missing edges: intracluster pairs with similarity that falls below 
the threshold. Extra edges: intercluster pairs that have similarity above the threshold. The threshold for each algorithm was set in a way 
that optimizes its Minkowski score (the threshold used for the true solution was the same as for the HCS algorithm). True: the true clustering 
solution. HCS without merge: the HCS algorithm without the cluster merging phase. HCS: the fuU algorithm. Greedy: the algorithm of 
Milosavljevic etal. (1995). Minkowski: the score of solution quality (see text). Singletons: elements that are left unclustered by the algorithms. 
Note the extremely high error rate in the true solution. 



CLUSTERING cDNA FINGERPRINTS 



255 




FIG. 2. Impact of problem parameters on the performance of the algorithm on simulated fingerprints, and comparison with the Greedy 
algorithm. Cluster structure: 450 elements in nine clusters of sizes 10, 20, 30, ... , 90. The number of probes is 200. /3 = 0.0015, so the 
expected rate of false-positive hybridizations is 25%. a = 0.4, so the expected false-negative hybridization rate is 40%. L a = 500 nt; L„ = 2500 
nt. Poisson rate for probe appearance is A = 0.005. In each experiment one parameter value was changed while the rest were kept at the 
default values. Averages and standard deviations are based on 50 simulations per data point. All results are for the Minkowski score. (A) 

Impact of problem size and comparison to Greedy. Cluster sizes were 10, 20, 30 130, and the different problem sizes were obtained by 

taking the first 3, 4 13 sizes. HCS: dotted line; Greedy: continuous line. Error bars denote 1 standard deviation. (B) Impact of the 

number of probes. (C) Impact of false-positive rate. (D) Impact of false-negative rate. 



suits not shown). Consequently, a good value can be 
rapidly found. The initial value of 0 can be based on 
prior knowledge of the abundance distribution and the 
expected noise levels. 

In our simulation we assumed that all probes have 
the same rate, which is not satisfied by randomly cho- 
sen probes on real gene sequences. However, as was 
shown by Mayraz and Shamir (1999), this can be rem- 
edied by a judicious choice of probes. The results on the 
real data set also demonstrate that the effect of this 
problem is not large. 

The simulation we have performed demonstrates the 
robustness of the algorithm to very high noise levels. It 
is, however, limited to generating only binary finger- 
prints. This handicaps to some extent the capabilities 
of the HCS algorithm and to an even larger extent the 
Greedy algorithm, which depends strongly on the real 
valued fingerprints. Generating real valued finger- 
prints would be more realistic, but the complex process 
of obtaining the real fingerprints is not easy to model in 
simulation. The results of the algorithm with real val- 
ued fingerprints should be at least as good as with 
binary fingerprints, since more information is ex- 
ploited. 

Additional improvements to the algorithm can be 
achieved by using a faster minimum cut algorithm 



(e.g., Karger, 1996) and by attempting to find maximal 
highly connected subgraphs (e.g., using the cohesive- 
ness function of Matula, 1972). Using a weighted min- 
imum-cut algorithm may also improve the results. 

A comparison of our algorithm with classical clus- 
tering algorithms like /c-means (Hartigan, 1975) may 
be interesting. One clear advantage of our algorithm 
is that the number of clusters need not be prespeci- 
fied. Comparison with classical hierarchical cluster- 
ing algorithms is also of interest. Finally, although 
the algorithm we developed was tested in the context 
of gene expression, its use is not limited to this 
application, and it can be used to solve other clus- 
tering problems. 
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